Value Added Tagging for Multilingual Resource Management
نویسندگان
چکیده
The Legebiduna project brings together state-ofthe-art techniques in multilingual corpus management, generic mark-up, text segmentation and alignment, terminological extraction, automatic text cataloguing, and reutilisation of recurrent text in specialised documentation. We report on the experience of a four year project of bilingual corpus mining in a dedicated domain of official bilingual publications. Considerable effort has been made in developing tools for the automatic processing of a collected parallel corpus of 7 million words in both Spanish and Basque. Experiments have been undertaken on a half million word sample of the corpus, and the results are very satisfactory. Legebiduna has now become a prototype of a domain-expert editing tool that helps both institutional writers and translators to carry out their work in an optimal computer oriented environment.
منابع مشابه
Multilingual Projection for Parsing Truly Low-Resource Languages
We propose a novel approach to cross-lingual part-of-speech tagging and dependency parsing for truly low-resource languages. Our annotation projection-based approach yields tagging and parsing models for over 100 languages. All that is needed are freely available parallel texts, and taggers and parsers for resource-rich languages. The empirical evaluation across 30 test languages shows that our...
متن کاملTags and self-organisation: a metadata ecology for learning resources in a multilingual context
Social tags offer a novel aspect to study learning resources, its metadata and how users interact with them. The key theme in this research is to understand the central role of social tagging for Technology Enhanced Learning (TEL), more specifically, for digital learning resources in a multilingual context. The main hypothesis is that the self-organisation aspect of a social tagging system on a...
متن کاملWhat Can We Get From 1000 Tokens? A Case Study of Multilingual POS Tagging For Resource-Poor Languages
In this paper we address the problem of multilingual part-of-speech tagging for resource-poor languages. We use parallel data to transfer part-of-speech information from resource-rich to resourcepoor languages. Additionally, we use a small amount of annotated data to learn to “correct” errors from projected approach such as tagset mismatch between languages, achieving state-of-the-art performan...
متن کاملUnsupervised Multilingual Learning for POS Tagging
We demonstrate the effectiveness of multilingual learning for unsupervised part-of-speech tagging. The key hypothesis of multilingual learning is that by combining cues from multiple languages, the structure of each becomes more apparent. We formulate a hierarchical Bayesian model for jointly predicting bilingual streams of part-of-speech tags. The model learns language-specific features while ...
متن کاملCorporate Language Resources In Multilingual Content Creation, Maintenance And Leverage
This paper focuses on how language resources (LR) for translation (hence LR4Trans) feature, and should ideally feature, within a corporate workflow of multilingual content development. The envisaged scenario will be that of a content management system that acknowledges the value of LR4Trans in the organisation as a key component and corporate knowledge resource.
متن کامل